All of this is new and quite promising, I do not know anything about the subject and learning will be probably challenging. However I am eager to start. I have experienced some difficulties in this first task as I am unfamiliar with the whole thing. In the end, I succeeded to overcome those problems.
I’d like to learn about processing of large datasets to be used in my research. My specialization is in political science and organizations. So, I would like to learn mainly about working with statistics and also with data visualization.
My supervisor has warmly recommended me to enrol in this course.
First of all, I read the new dataframe with the function read.table().
Then I use of the function dim() to show the dimension of the dataframe that is of 166 objects and 7 variables as explained in the above-mentioned data wrangling section:
## [1] 166 7
By typing the function str(), it shows the structure of the dataframe:
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
To visualize the data I type the functions install.packages() to install the visualization packages ggplot2 and GGally. Then by using the function library() I open them in the project.
install.packages(“ggplot2”)
install.packages(“GGally”)
library(ggplot2)
The libraries are opened:
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
By using the fast plotting function pairs(), we can see a scatterplot matrix resulting from the draws of all the possible scatterplots from the columns of the dataframe. Different colors are used for males and females.
This second plot matrix is more advanced, and it is made with ggpairs().
The summary of the variables:
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The females are almost double of the males who present a wider age range. The summary suggest a significant correlation for surf vs deep and points vs attitude.
I have chosen three variables “attitude”, “deep learning” and “surface learning”, with the target variable “exam points” to fit a regression model analysis.
Drawing a plot matrix with ggpairs().
Fitting the regression models with three explanatory variables and running the summary:
##
## Call:
## lm(formula = Points ~ Attitude + deep + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3551 4.7124 3.895 0.000143 ***
## Attitude 3.4661 0.5766 6.011 1.18e-08 ***
## deep -0.9485 0.7903 -1.200 0.231815
## surf -1.0911 0.8360 -1.305 0.193669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
MISSING
The adjusted R square of 0.1876 denotes a poorly fitting function to the explanatory variables. Only attitude presents statistical significance.
A Multiple Linear Regression Model has few assumptions:
a linear relationship between the target variable and the explanatory variables, usually revealed by scatterplots;
a multivariate normality, which means that the residuals are normally distributed. The QQ-plot can reveal it;
the absence of multicollinearity, in other words, the explanatory variables are not highly correlated to each other.
homoscedasticity: or constant variance of errors. There is a similar variance of error terms across the values of the explanatory variable. A plot of standardized residuals versus predicted values shows if the points are equally distributed across all values of the dependent variables.
The diagnostic plots delivered the following observations:
In the residuals vs fitted values, the plot is utilized to check the assumption of linear relationship. An horizontal line, without distinct patterns is an indicator for a linear relationship, in this case, the red line is more or less horizontal at zero. Hence here linear relationship could be assumed.
In the normal QQ-plot the plot reveals the presence of multivariate normality. A good indication is if the residual points follow the straight dashed line. For the majority it is the case here, hence normality can also be assumed.
In the residuals vs leverage, the plot identifies the impact of a single observation on the model. Influential points lie at the upper or at the lower right corner in a position where they are influential against a regression line. In this case the points are on the left side of the plot, thus we can say that there is no leverage.
In this chapter, we will analyze a dataset resulted from the data wrangling of another dataset about students’ performance in high school in mathematics and portugese language. Hereinafter are the variable of the dataset we are going to analyze:
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Alltogether the dataset has a dimension of:
## [1] 382 35
I choose four variables of interest on which I build some hypotheses:
| VARIABLE CHOSEN | RELATED HYPOTHESIS |
|---|---|
"goout" |
H1: Students who go out more have higher alcohol consumption. |
"freetime" |
H2: Students who have more free time are more prone to drink. |
"studytime" |
H3: The more the studytime, the less a student drinks. |
"romantic" |
H4: Given the courting dynamics, romantics drink less. |
Let’s start with a graphical overview of the dataset variable distribution, in relation to the alcohol consumption. By installing and calling "tidyr", "dplyr", and "ggplot2", and then by combining the function glimpse() via the pipe operator %>% to the plot generating function ggplot(); we get the following plot:
Let’s now focus on the cross-tabulations for the specific interest variables on which hypotheses were posed:
## # A tibble: 38 x 4
## # Groups: goout [5]
## goout alc_use count mean_grade
## <int> <dbl> <int> <dbl>
## 1 1 1 14 11.1
## 2 1 1.5 3 7
## 3 1 2 2 13.5
## 4 1 2.5 2 12
## 5 1 3.5 1 10
## 6 2 1 49 12.6
## 7 2 1.5 21 12.1
## 8 2 2 14 10.3
## 9 2 2.5 8 12.4
## 10 2 3 4 10.8
## # ... with 28 more rows
## # A tibble: 37 x 4
## # Groups: freetime [5]
## freetime alc_use count mean_grade
## <int> <dbl> <int> <dbl>
## 1 1 1 10 10.3
## 2 1 1.5 1 16
## 3 1 2 4 12.2
## 4 1 3 1 10
## 5 1 4 1 8
## 6 2 1 18 13.3
## 7 2 1.5 24 12.9
## 8 2 2 7 10.3
## 9 2 2.5 8 11.8
## 10 2 3 6 10.7
## # ... with 27 more rows
## # A tibble: 30 x 4
## # Groups: studytime [4]
## studytime alc_use count mean_grade
## <int> <dbl> <int> <dbl>
## 1 1 1 21 12.2
## 2 1 1.5 20 10
## 3 1 2 17 10.1
## 4 1 2.5 10 11.5
## 5 1 3 12 10.1
## 6 1 3.5 11 9.09
## 7 1 4 4 13
## 8 1 5 5 9.6
## 9 2 1 72 11.7
## 10 2 1.5 35 12.1
## # ... with 20 more rows
## # A tibble: 18 x 4
## # Groups: romantic [2]
## romantic alc_use count mean_grade
## <fct> <dbl> <int> <dbl>
## 1 no 1 98 12.3
## 2 no 1.5 47 11.8
## 3 no 2 35 11.3
## 4 no 2.5 32 11.8
## 5 no 3 25 10.4
## 6 no 3.5 13 11.1
## 7 no 4 6 10.3
## 8 no 4.5 1 10
## 9 no 5 4 10
## 10 yes 1 42 11.1
## 11 yes 1.5 22 11.2
## 12 yes 2 24 11.2
## 13 yes 2.5 12 11.7
## 14 yes 3 7 8.71
## 15 yes 3.5 4 7.5
## 16 yes 4 3 10
## 17 yes 4.5 2 11
## 18 yes 5 5 11.6
By using logistic regression we statistically explore the relationship between the four selected variables and the binary high/low alcohol consumption variable as the target variable.
##
## Call:
## glm(formula = high_use ~ goout + freetime + studytime + romantic,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6927 -0.7743 -0.5479 1.0022 2.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.31564 0.62593 -3.700 0.000216 ***
## goout 0.72871 0.12071 6.037 1.57e-09 ***
## freetime 0.08366 0.13415 0.624 0.532863
## studytime -0.58629 0.16508 -3.552 0.000383 ***
## romanticyes -0.18301 0.26722 -0.685 0.493442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 399.98 on 377 degrees of freedom
## AIC: 409.98
##
## Number of Fisher Scoring iterations: 4
## (Intercept) goout freetime studytime romanticyes
## -2.31564309 0.72871166 0.08366382 -0.58628767 -0.18300551
Odd ratio (OR) is obtained with the division of the odds of “success” (Y = 1) for students who have the property X, by the odds of “success” of those who have not. As OR quantifies the relations between X and Y, Odds higher than 1 indicates that X is positively associated with “success”. The odds ratios can be seen also as exponents of the coefficients of a logistic regression model.
Computation of the odds ratio (OR)
Computation of the confidence intervals for the coefficients by the function confint(), and exponentiation of the values by using exp().
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -3.5691044 -1.1092870
## goout 0.4982923 0.9725916
## freetime -0.1795761 0.3477794
## studytime -0.9207772 -0.2716715
## romanticyes -0.7148612 0.3353712
Obtaining the odds ratio with their confidence intervals by using cbind():
## OR 2.5 % 97.5 %
## (Intercept) 0.09870269 0.02818108 0.3297940
## goout 2.07240892 1.64590816 2.6447898
## freetime 1.08726331 0.83562438 1.4159199
## studytime 0.55638895 0.39820941 0.7621045
## romanticyes 0.83276356 0.48926002 1.3984594
Values bigger than 1 are seen fully in goout, freetime (except for 2,5%), and in 97.5% of romantic, here there is positive correlation. These results moslty confirmedv my hypotheses apart from the studytime.
First we use the function to predict() the probability of high use, after modifying the dataset 'alc' with the new integration we move to predict probabilities and classes, and to tabulate the target variables versus the predictions:
## goout freetime studytime romantic probability prediction
## 373 2 3 1 no 0.23263069 FALSE
## 374 3 4 1 no 0.40585185 FALSE
## 375 3 3 3 no 0.16282192 FALSE
## 376 3 4 1 yes 0.36258869 FALSE
## 377 2 4 3 no 0.09258880 FALSE
## 378 4 3 2 no 0.42009575 FALSE
## 379 2 2 2 no 0.13429940 FALSE
## 380 1 1 2 no 0.06441395 FALSE
## 381 5 4 1 no 0.74578989 TRUE
## 382 1 4 1 no 0.13722123 FALSE
## prediction
## high_use FALSE TRUE
## FALSE 247 21
## TRUE 73 41
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64659686 0.05497382 0.70157068
## TRUE 0.19109948 0.10732984 0.29842932
## Sum 0.83769634 0.16230366 1.00000000
Accuracy measures the performance in binary classifications as the average number of correctly classified observations. The mean of incorrectly classified observations can be seen as a penalty function of the classifier: the less the better. In this section, first we define a loss function loss_func(), and then we apply it to probability = 0, probability = 1 and then to the prediction probabilities in alc.
## [1] 0.2984293
## [1] 0.7015707
## [1] 0.2460733
The first and third functions deliver better results than in the case of probability = 1. It works better than guessing.
Cross validation is a technique to assess how the results of a statistical analysis will generalize to an independent data set. In a cross validation, a sample of data is partitioned into complementary subsets (training, larger and testing, smaller), performing the analysis on the former and validating the results on the latter. The subsets used here are K = 10.
## [1] 0.2460733
With leave-one-out cross validation:
## [1] 0.2460733
with 10-fold cross validation:
## [1] 0.2486911
The ten-fold cross validation shows higher prediction error on the testing data compared to the training data. It is also lower than the 0.26 in the Datacamp exercise.
At first I use a logistic regression model with 22 predictors.
## [1] 0.2460733
The function is performed with leave-one-out cross validation.
## [1] 0.2539267
Here the result is given by ten-fold cross validation.
## [1] 0.2722513
With 15 predictors
The function is performed with leave-one-out cross validation.
## [1] 0.2696335
Here the result is given by ten-fold cross validation.
## [1] 0.2617801
With 10 predictors
The function is performed with leave-one-out cross validation.
## [1] 0.2670157
Here the result is given by ten-fold cross validation.
## [1] 0.2565445
With 5 predictors
The function is performed with leave-one-out cross validation.
## [1] 0.2460733
Here the result is given by ten-fold cross validation.
## [1] 0.2460733
The dataset Boston, is about the housing values in the suburbs of the homonym city. I use the functions str() and dim() to explore the dataset.Here is its structure:
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
And here its dimension:
## [1] 506 14
Let’s have a look at the summary() of the variables:
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Using the function pairs() we obtain the following graphical overview:
From the plot above is a bit difficult to see relations between variables. Let’s try to use something else, for instance a correlation plot. By using the function corrplot() we can obtain a visual way to look at correlations. First we need to calculate the correlation matrix by using cor():
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
Now that we have the matrix, rounded to the first two digits, we can proceed to create the correlation plot by using corrplot(). Here is how it looks like:
The corrplot() provides us with a graphical overview of the Pearson’s correlation coefficient calculated with cor. This measure quantifies the degree to which an association tends to a certain pattern. In this case it summarize the strength of a linear association. As we see here, the value 0 means that two variables are uncorrelated. A value of -1 (in red) or 1 (in blue) shows that they are perfectly related.
As we can see here, the dimensions and intensity of colour of the dots visually shows the strenght of the linear associations. I used order = "hclust" as the ordering method for this correlation matrix as it makes the matrix more immediate to read. Among the strongest negative correlations there are: dis nox, dis indus, dis age, lstat rm, and lstat medv. On the contrary, among the strongest positive correlations we find: tax rad, tax indus, nox indus, nox age. Overall, only the variable chas seems to have very little if none statistical correlation at all.
We scale the dataset by using the scale() function, then we can see the scaled variables with summary():
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
The function scale() operated on the variables by subtracting the columns means from the corresponding columns and dividing the difference with standard deviation. Here it was possible to scale the whole dataset as it contains only numerical values.
The class of the boston_scaled object is a:
## [1] "matrix"
so, to complete the procedure, we change the object into a data frame.
To create the categorial variable, we use the function cut() together with quantile()to have our factor variable divided by quantiles in order to get four rates of crime:
## crime
## low med_low med_high high
## 127 126 126 127
train and test sets.At first we use nrow() to count the number of rows in the dataset:
## [1] 506
then with ind <- sample() we randomly choose a 80% of them to create the train dataset. With the remaining material we create the test set.
In this section we fit a linear discriminant analysis on the train set, using the categorical crime rate as the target variable, and the other variables are the predictors. Here we can see the plot:
We will now run a LDA model on the test data, but before that we will save the crime categories from the test set and then we will remove the categorical crime variable from the test dataset.
Here is the cross tabulation of the results with the crime categories from the test set:
## predicted
## correct low med_low med_high high
## low 15 9 2 0
## med_low 7 16 2 0
## med_high 1 13 16 1
## high 0 0 0 20
MISSING
To measure the distance between the observation, at first we standardize the dataset by using data.Normalization() with type = n1.
Then we run the distance between observations by using the function dist(), which utilizes the euclidean distance, the most common distance measure, then we use also the manhattan method:
After that, we calculate and visualize the total within sum of squares, by using set.seed(123) to prevent assigning random cluster centres, and setting the maximum number of clusters at 10.
Using the elbow method I think I will choose to go with three centers.
then we run the k-means(), I divide the plot in four to improve the clarity:
To be sure, I also try something different, for instance five.
MISSING
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Here, as before, is the procedure for the k-means with clusters > 2:
again, we run the k-means():
And here the LDA model, since the variable chas appeared to be a constant within group calls I removed it:
The most influential variable as cluster linear separator is the variable tax.
In this section we run the code on the train scaled dataset to produce two 3D-plots.
In this first 3D-plot the color is given by the train$crime:
In this second 3D-plot the color is defined by the km$cluster:
To visualize the data, we use both ggpairs() and corrplot():
library(GGally); library(corrplot); library(dplyr)
ggpairs(human_)
cor(human_) %>% corrplot(method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, order = "hclus")
The observations are uneven, not all of them follow a normal distribution as the plot suggest, this may require a further standardization later on. Among the strongest inverse correlations we logically have Mat.Mor and Life.Exp. Also, Life.Exp and Ado.Birth have an inverse correlations. Edu.Exp, GNI and Edu2.FM all have a negative correlation with those two variables. As for the direct correlations, the strongest are Ado.Birth and Mat.mor, which shows that the adolescent birth rate is positively correlated with maternal mortality. Then, logically we see that Life.Exp, has a strong correlation with Edu.Exp GNI, and Edu2.FM. Also Edu.exp has a positive correlation with GNI and Edu2.FM. Some positive correlation is found also between this last one and GNI. As for the variables Labo.FM and Parli.F, these do not have significant statistical correlations with the other variables.
PCA is a statistical procedure that decomposes a matrix into a product of smaller matrices and revals the most important components. In other words, the data are transformed into a new space with equal or less number of dimensions. The first dimension reveals the maximum amount of variance from the features in the original data. As for the second component, it captures the maximum amount of variability left. The principal components are uncorrelated, and the importance of captured variance decreases for each of them. In this paragraph we run a PCA on the data visible in the following summary(). We will use the function prcomp()that employs the more numerically accurate Singular Value Decomposition (SVD).
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Here we have the summary()for the principal component analysis:
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
Before proceeding to the plot, we set the rounded percentage of the variance captured by each principal component. In this case we choose to include only one digit, we create also the axis labels:
pca_pr1 <- round(100*s1$importance[2,], digits = 1)
pca_pr1
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
Then we visualize the results with a biplot() that displays the observations and the original features (as a scatterplot), and also their relationship with both each others and with the principal components (in form of arrows or labels):
biplot(pca_human1, choices = 1:2, cex = c(1, 1.2), col = c("grey50", "deeppink2"), xlab =pc_lab1[1], ylab = pc_lab1[2])
The variability captured by the principal components are respectively of 100% in PC1 and of 0% in PC2, also all the others PCs are 0. This result could have been influenced by the non-standardization of the data as described more below.
It is a good idea to standardize the data because the PCA is sensitive to the relative scaling of the original features. It also assumes that features with larger variance are more important than features with smaller variance. By standardizing variables with different units, we can compare the values of these variables. So, now we repeat the same analysis with standardized variables, whose summary() is the following:
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
and here we have thesummary()for the PCA:
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
We follow again the same procedure as for the non-standardized data, for the rounding of digits and label creation and we obtain these results:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
biplot(pca_human2, choices = 1:2, cex = c(1, 1.2), col = c("grey50", "deeppink2"), xlab =pc_lab2[1], ylab = pc_lab2[2])
The results this time seem to be different. With the non-standardized data we had a PC1 that covered the 100% of the data and a PC2 with 0% as the other 6 PCs. In here, using standardized data, we have a PC1 that covers the 53,6% of the data, and then a PC2 that covers the 16,2%; the subsequent PCs are respectively 9.6%, 7.6%, 5.5%, 3.6%, 2.6%, and 1.3%.
By choosing the first few principal components we will have uncorrelated variables that capture the maximum amount of data variation. In the PCA with standardized data, The variability captured is shown in the axes X and Y. So we have a CP1 that captured the 53,6% variation and CP2 that captured the 16,2% variation. PCA is an unsupervised method due to the absence of criteria or target variable. It is a powerful method to encapsulate correlations between the original features into a smaller number of uncorrelated dimensions. As for the original variables represented by the arrows the Female Representation in Parliament and the Labo.FM are related and constitute the main contributor to the PC2 whilst Maternal Morality rate, and Adolescent Birth Rate are also related but contribute to the PC1. Also Life Exp. and Education Expectancy, Edu2.FM, and GNI are related between them and contribute to PC1. these three groups are almost orthogonal, and that indicates that they are uncorrelated.
Multiple Correspondence Analysis can be used if the data consist of categorical variables as in this tea dataset. Here are shown dim(), str(), and summary() of the dataset variables.
In this section we will use the dataset tea from the FactoMineR package. I keep only some variables: "Tea", "How", "how", "sugar", "where", "lunch". Here there are the dimension, and structure and summary of the newly composed tea_time dataset:
install.packages("FactoMineR", repos="https://cloud.r-project.org/")
## package 'FactoMineR' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\iper\AppData\Local\Temp\RtmpSCKtpX\downloaded_packages
library(FactoMineR); library(dplyr); library(ggplot2); library(tidyr)
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
dim(tea_time)
## [1] 300 6
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
Here we proceed to visualize the tea_time dataset with the following code:
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
A MCA Biplot shows the possible variable pattern, the distance between them, gives a measure of their similarity. To run a MCA we use the following code:
library(MASS)
mca <- MCA(tea_time, graph = FALSE)
Here we can see the Scree plot:
## package 'factoextra' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\iper\AppData\Local\Temp\RtmpSCKtpX\downloaded_packages
To visualize the MCA variable togheter with the individuals I will use a different procedure, because sometimes the biplot is a bit chaotic due to high concentrations. First I find the number of categories per variables:
cats = apply(tea_time, 2, function(x) nlevels(as.factor(x)))
cats
## Tea How how sugar where lunch
## 3 4 3 2 3 2
Then I obtain the table of eigenvalues, the column and heads coordinates which will form the data frames:
mca$eig
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.27937118 15.238428 15.23843
## dim 2 0.26092645 14.232352 29.47078
## dim 3 0.21933575 11.963768 41.43455
## dim 4 0.18943794 10.332978 51.76753
## dim 5 0.17722310 9.666715 61.43424
## dim 6 0.15617745 8.518770 69.95301
## dim 7 0.14375727 7.841306 77.79432
## dim 8 0.14126310 7.705260 85.49958
## dim 9 0.11717818 6.391537 91.89111
## dim 10 0.08660997 4.724180 96.61529
## dim 11 0.06205294 3.384706 100.00000
head(mca$var$coord)
## Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
## black 0.47266156 0.09385258 -1.08063006 0.56186080 0.55465156
## Earl Grey -0.26425693 0.12347079 0.43288074 0.01902771 -0.21547686
## green 0.48559492 -0.93257436 -0.10846539 -1.37121354 0.01644904
## alone -0.01774999 -0.26156928 -0.11267800 -0.36538842 -0.39693332
## lemon 0.66914791 0.53068346 1.32935122 1.00939803 -0.37701185
## milk -0.33671028 0.27164647 0.01305056 0.25378949 1.57343677
head(mca$ind$coord)
## Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
## 1 -0.2975376 -0.3278777 -0.3272610 0.3649751 -0.001711145
## 2 -0.2371102 -0.1360322 -0.6954324 0.2938098 0.818716502
## 3 -0.3689027 -0.3003457 -0.2015593 -0.1511548 -0.266254544
## 4 -0.5299062 -0.3182139 0.2113554 0.1571100 -0.306607147
## 5 -0.3689027 -0.3003457 -0.2015593 -0.1511548 -0.266254544
## 6 -0.3689027 -0.3003457 -0.2015593 -0.1511548 -0.266254544
mca_vars_df = data.frame(mca$var$coord, Variable = rep(names(cats),
cats))
mca_obs_df = data.frame(mca$ind$coord)
With these values I produced the following plot:
ggplot(data = mca_vars_df, aes(x = Dim.1, y = Dim.2, label = rownames(mca_vars_df))) + geom_hline(yintercept = 0, colour = "gray70") + geom_vline(xintercept = 0, colour = "gray70") + geom_text(aes(colour = Variable)) + ggtitle("MCA plot of variables using R package FactoMineR")
We can develop the plot to include observations and categories as in a MCA biplot. Because of the overlapping of the individuals, there will be instead the use of density curves to see the zones of high concentration:
ggplot(data = mca_obs_df, aes(x = Dim.1, y = Dim.2)) + geom_hline(yintercept = 0, colour = "gray70") + geom_vline(xintercept = 0, colour = "gray70") + geom_point(colour = "gray50",
alpha = 0.7) + geom_density2d(colour = "gray80") + geom_text(data = mca_vars_df,
aes(x = Dim.1, y = Dim.2, label = rownames(mca_vars_df), colour = Variable)) + ggtitle("MCA plot of variables using R package FactoMineR") + scale_colour_discrete(name = "Variable")
Here we see the summary() of the mca, which will be commmented on, in the next section:
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
In this output we can see the Eigenvalues, in other words the variances and the percentage of variances retained by each dimensions.
There are also individuals coordinates, and contribution to the dimension expressed in percentage, and the cos2, the squared correlations on the dimensions.
In the categories section, we see the coordinates of the varibale categories, the contribution in percentage, the cos2 and v.test value that follows the normale distribution. If it is below/above 1.96 + o - it is significantly different from zero. The categorical variables shows the squared correlation between each variable and dimensions. If the value is close to one it suggests a strong link with the variable and dimension.
Comments on findings and result comparation against previous hypotheses.
Overall, for the chosen variables, only the females seem to present outliers. As for the whiskers the females vary more than males except for the variable romantic. The only skewed plot is the romantic males’ one. Let’s now proceed to the hypotheses’ comparation.
"goout"and H1.Concerning the variable
"goout", I hypotesized (H1) that students who go out more experiences higher alcohol consumption. Overall, it seems that people who go out more, have higher consumptions of alcohol. The most starking differences are between class 1 and 3. But people at 2 have higher consumptions than people at 5.The higher consumption is registered at 3. So it is not self-evident that the more a student goes out, the more drinks, the hypothesis H1 is not entirely correct."freetime"and H2.In regards to the variable
"freetime", its related hypothesis (H2) was that students who have more free time are more prone to drink. Again, the levels for the answers 3 - 4 are much higher than 1 - 2 and 5. On a general level the hypothesis H2 seems correct. But the distributions of the results see a stark decrease at 5, which has even lower levels than at 2."studytime"and H3.When it comes to the findings of the variable
"studytime", it seems to corroborate the hypothesis (H3) according to which the more the studytime, the less a student drinks. The value in 2 is higher than 1 but the consumption decreases at the increasing of the studytime."romantic"and H4.Finally, to the variable
"romantic", I associated the hypothesis (H4) that romantics drink less than non-romantics, due to courting dynamics. The results seem to confirm the hypothesis H4. We see that non-romantics drink as much as twice compared to romantics, males percentage is higher than female one in this category, with a slightly larger gap in non-romantic.